library(tidyverse)
library(plotly)
library(shiny)
library(flexdashboard)
library(viridis)
data_clean <- read.csv("data_final.csv")

Project Name: Exploring NYC Shooting Data

Group Leader: Zebang Zhang (zz3309)

Other Members: Chenhui Yan (cy2772), Zhaokun Lin (zl3544), Mingyin Wang (aw3693)

Topic

In this exploratory study, we aim to analyze the geographic and temporal distribution of shootings in New York City, with an emphasis on how socioeconomic factors influence these patterns.

Motivation

Understanding the dynamics of shooting incidents in New York City is crucial for enhancing public safety and fostering resilient communities. By examining demographic patterns in shooting crime data alongside datasets on high school graduation rates and poverty levels, our project aims to identify the socioeconomic factors that influence gun violence. Analyzing how disparities in education and income relate to shooting incidents provides valuable insights for targeted interventions.

Introduction

Gun violence is a significant public health and social issue New York City, where diverse communities and complex socioeconomic conditions create a unique landscape for understanding its contributing factors. This project examines the geographic, temporal, and socioeconomic patterns of shootings in NYC to identify critical influences on gun violence. We explore how shooting incidents are distributed across boroughs and neighborhoods and examine temporal trends, such as seasonal, monthly, and time-of-day variations. Additionally, we investigate the relationship between education and gun violence, focusing on whether lower high school graduation rates correlate with higher shooting prevalence. Finally, we analyze the connection between poverty and gun violence to determine if neighborhoods with higher poverty levels experience increased incidents of shootings.

Significance

This research will aid in informing targeted public health and safety interventions by identifying the communities most impacted by shootings. Understanding these patterns will enable policymakers to implement more effective violence prevention measures and allocate resources where they are needed the most.

Research Questions

Our project focuses on examining the socioeconomic and temporal factors that influence gun violence in New York City. Specifically, we aim to answer the following questions:

Geographic and Temporal Patterns of Shootings in NYC: - How are shooting incidents distributed geographically across different boroughs and neighborhoods in New York City? - What are the temporal patterns of shootings (e.g., month, season, time of day) in different areas of NYC?

Education and Gun Violence: - Is there an association between high school graduation rates and the prevalence of shootings across NYC neighborhoods? - Are neighborhoods with lower high school graduation rates more likely to experience higher rates of gun violence?

Poverty and Gun Violence: - How does the percentage of people living below the poverty line relate to shooting incidents across neighborhoods in NYC? - Are higher poverty levels associated with an increased frequency of shootings?

Data source

Original data

This report analyzes shooting incidents in NYC. You can find the original data before merging here. This data is from NYCOpenData, consists some basic information: occur time, coordinate of each incident, victim’s information, .etc.

data for additional information

  • NTA shape data

Neighborhood Tabulation Areas (NTAs) are medium-sized statistical geographies for reporting Decennial Census and American Community Survey. NTAs were delineated with the need for both geographic specificity and statistical reliability in mind. Shapefile of NTA is get from NYC Planning website, you can find the data here, or you can download

  • CDTA data

The Department of City Planning (DCP) created Community District Tabulation Areas (CDTAs) to closely approximate the 59 Community Districts of New York City for the purpose of reporting American Community Survey (ACS) data. You can find the data here, or you can download

  • Borough Boundaries data

The Borough Boundaries dataset from the NYC Open Data portal provides detailed information about the geographical boundaries of the five boroughs of New York City: Bronx, Brooklyn, Manhattan, Queens, and Staten Island. You can find the data here

  • Neighborhood data

Here is neighbourhood list for geo filter. Sourced from city or open source GIS files. Here is GeoJSON file of neighbourhoods of the city.

  • Education data

The data source for education is shown here.

  • Poverty data

The data source for poverty is here

Data processing

The original data only consists of some basic information and doesn’t show NTA, CDTA, Borough and neighborhood information of each incident, so we added these information to the original data base on the coordinate of each incident, we also add some demographic information: population, education level, poverty level, .etc to the data.

Here’s how we added the information based on all the data we have:

  • enrich NYC shooting incident data with neighborhood-level information by performing spatial operations, converts the shooting data into a spatial format, joins it with neighborhood boundaries, and adds additional neighborhood attributes for deeper spatial analysis.
    detailed code
    df=read_csv("./data/NYPD_Shooting_Incident_Data__Historic__20241119.csv")
    
    # Ensure coordinate columns are numeric
    df = df %>%
      mutate(
        X_COORD_CD = as.numeric(X_COORD_CD),
        Y_COORD_CD = as.numeric(Y_COORD_CD)
      )
    
    # Convert to sf object with CRS EPSG:2263
    df_sf = st_as_sf(df, coords = c("X_COORD_CD", "Y_COORD_CD"), crs = 2263)
    
    # Download and read the GeoJSON file
    download.file(
      url = "https://data.insideairbnb.com/united-states/ny/new-york-city/2024-09-04/visualisations/neighbourhoods.geojson",
      destfile = "neighbourhoods.geojson",
      mode = "wb"
    )
    neighbourhoods = st_read("neighbourhoods.geojson")
    
    # Check and transform CRS of neighborhoods
    if (st_crs(neighbourhoods)$epsg != 2263) {
      neighbourhoods = st_transform(neighbourhoods, crs = 2263)
    }
    
    # Perform the spatial join
    df_joined = st_join(df_sf, neighbourhoods, left = TRUE)
    
    # Add neighborhood information to the data frame
    df$Neighborhood = df_joined$neighbourhood
    # Download and use the neighborhoods CSV file
    download.file(
      url = "https://data.insideairbnb.com/united-states/ny/new-york-city/2024-09-04/visualisations/neighbourhoods.csv",
      destfile = "neighbourhoods.csv",
      mode = "wb"
    )
    neighborhoods_csv = read_csv("neighbourhoods.csv")
    
    # Merge additional attributes if necessary
    df = df %>%
      left_join(neighborhoods_csv, by = c("Neighborhood" = "neighbourhood"))
  • enrich NYC shooting incident data with neighborhood-level information by performing spatial operations, converts the shooting data into a spatial format, joins it with neighborhood boundaries (both 2020 and 2010 shapefiles), and adds additional neighborhood attributes for deeper spatial analysis.
    detailed code
    df$X_COORD_CD <- as.numeric(df$X_COORD_CD)
    df$Y_COORD_CD <- as.numeric(df$Y_COORD_CD)
    
    # Convert to sf object
    df_sf <- st_as_sf(df, coords = c("X_COORD_CD", "Y_COORD_CD"), crs = 2263)
    
    # Read neighborhood shapefile
    nta <- st_read("./data/nynta2020_24d/nynta2020.shp")
    
    # Transform CRS if necessary
    nta <- st_transform(nta, crs = st_crs(df_sf))
    
    # Perform spatial join
    df_joined <- st_join(df_sf, nta, left = TRUE)
    
    # Add neighborhood information to your original data frame
    df$NTA <- df_joined$NTAName  # Adjust based on actual column name
    
    ## add 2010 nta data
    df$X_COORD_CD <- as.numeric(df$X_COORD_CD)
    df$Y_COORD_CD <- as.numeric(df$Y_COORD_CD)
    
    # Convert to sf object
    df_sf <- st_as_sf(df, coords = c("X_COORD_CD", "Y_COORD_CD"), crs = 2263)
    
    # Read neighborhood shapefile
    neighborhoods_2010 <- st_read("./data/nynta2010_24d/nynta2010.shp")
    
    # Transform CRS if necessary
    neighborhoods_2010 <- st_transform(neighborhoods_2010, crs = st_crs(df_sf))
    
    # Perform spatial join
    df_joined <- st_join(df_sf, neighborhoods_2010, left = TRUE)
    # Add neighborhood information to your original data frame
    df$NTA_2010 <- df_joined$NTAName  # Adjust based on actual column name
  • using the holidayNYSE() function to determine which dates are public holidays
    detailed code
    df$OCCUR_DATE <- as.Date(df$OCCUR_DATE, format = "%m/%d/%Y")
    
    # Define the range of years, handling NA values
    years <- seq(
      year(min(df$OCCUR_DATE, na.rm = TRUE)),
      year(max(df$OCCUR_DATE, na.rm = TRUE)),
      by = 1
    )
    us_holidays = holidayNYSE(years)
    us_holidays = as.Date(us_holidays)
    # Add a new column indicating whether the date is a holiday, year and month information of a date
    df = df %>%
      mutate(
        Is_Holiday = OCCUR_DATE %in% us_holidays,
        Year = year(OCCUR_DATE),
        Month = month(OCCUR_DATE)
      )
  • calculates dawn and dusk times for each unique date in the dataset. Using NYC’s coordinates (latitude and longitude), the getSunlightTimes() function generates dawn and dusk times for each day
    detailed code
    df$OCCUR_DATE <- as.character(df$OCCUR_DATE)
    df$OCCUR_TIME <- as.character(df$OCCUR_TIME)
    
    # Combine OCCUR_DATE and OCCUR_TIME into a single datetime string
    datetime_str <- paste(df$OCCUR_DATE, df$OCCUR_TIME)
    
    # Parse datetime using lubridate
    df$OCCUR_DATETIME <- ymd_hms(datetime_str, tz = "America/New_York")
    # Extract date from OCCUR_DATETIME
    df$DATE <- as.Date(df$OCCUR_DATETIME)
    
    # Get unique dates
    unique_dates <- unique(df$DATE)
    latitude <- 40.7128
    longitude <- -74.0060
    # Calculate dawn and dusk times
    sun_times <- getSunlightTimes(
      date = unique_dates,
      lat = latitude,
      lon = longitude,
      keep = c("dawn", "dusk"),
      tz = "America/New_York"
    )
    
    # Merge with the original dataframe
    df <- df %>%
      left_join(sun_times[, c("date", "dawn", "dusk")], by = c("DATE" = "date"))
    
    # Determine if the sky is dark considering twilight
    df <- df %>%
      mutate(
        Sky_Is_Dark = OCCUR_DATETIME < dawn | OCCUR_DATETIME >= dusk
      )
    df <- df %>%
      select(-DATE, -dawn, -dusk)
  • reads in population data from an Excel file, filters relevant rows, selects specific columns, and merges this population data with the original dataset.
    detailed code
    popu <- suppressWarnings(
      read_excel("./data/nyc_detailed-race-and-ethnicity-data_2020_core-geographies.xlsx", 
                 sheet = 1,
                 range = "A4:I2599",
                 col_types = c("numeric", "text", "text", "text", "numeric", "text", "text", "numeric", "numeric")) |>
        filter(`Orig Order` >= 2334 & `Orig Order` <= 2595) |>
        select(GeoName, NTAType, Pop) |>
        rename(Total_population = Pop)
    )
    
    df <- df |>
      left_join(popu, by = c("NTA" = "GeoName")) |>
      mutate(
        NTAType = case_when(
          NTAType == 0 ~ "Residential",
          NTAType == 9 ~ "Park",
          NTAType == 8 ~ "Airport",
          NTAType == 7 ~ "Cemetery",
          NTAType == 6 ~ "Other Special Areas",
          NTAType == 5 ~ "Rikers Island",
          is.na(NTAType) ~ "Unknown")
      )
  • add CDTA data similar to NTA
    detailed code
    df$X_COORD_CD <- as.numeric(df$X_COORD_CD)
    df$Y_COORD_CD <- as.numeric(df$Y_COORD_CD)
    
    df_sf <- st_as_sf(df, coords = c("X_COORD_CD", "Y_COORD_CD"), crs = 2263)
    
    cdta <- st_read("./data/nycdta2020_24d/nycdta2020.shp")
    
    cdta <- st_transform(cdta, crs = st_crs(df_sf))
    
    df_joined <- st_join(df_sf, cdta, left = TRUE)
    
    df$CDTA <- df_joined$CDTA2020
    
    df$CDTA <- gsub("([A-Z]{2})(\d{2})", "\1 \2", df$CDTA)
  • filters the poverty and education data for the relevant time period (2017-21) and geographic type (NTA2020), selects specific columns, and merges this poverty data with the shooting incident dataset.
    detailed code
    neighborhood_poverty <- read.csv("./data/neighborhood_poverty.csv")
    
    # Select specific columns from neighborhood_poverty
    neighborhood_poverty_selected <- neighborhood_poverty %>% 
      filter(TimePeriod == '2017-21') %>% 
      filter(GeoType == 'NTA2020' ) %>% 
      select(Number, Percent, Geography)
    # Merge the dataframes on 'NTA' and 'Geography', keeping all information from data_processing
    df_poverty <- left_join(df, neighborhood_poverty_selected, by = c("NTA" = "Geography"))
    
    # Rename columns after merging
    df_poverty <- df_poverty %>% 
      rename(Number_poverty = Number, Percent_poverty = Percent)
    neighborhood_education = read.csv("./data/graduated_high_school.csv")
    neighborhood_education_selected <- neighborhood_education %>% 
      filter(TimePeriod == '2017-21') %>% 
      filter(GeoType == 'NTA2020' ) %>% 
      select(Number, Percent, Geography)
    # Merge the dataframes on 'NTA' and 'Geography', keeping all information from data_processing
    df_education <- left_join(df_poverty, neighborhood_education_selected, by = c("NTA" = "Geography"))
    
    # Rename columns after merging
    df_education <- df_education %>% 
      rename(Number_education = Number, Percent_education = Percent)
    # filter between 2017 and 2023
    df_education$OCCUR_DATE <- as.Date(df_education$OCCUR_DATE, format = "%Y-%m-%d")
    
    # Filter the incidents that happened between 2017 and 2023
    df_education <- df_education %>% 
      filter(OCCUR_DATE >= as.Date("2017-01-01") & OCCUR_DATE <= as.Date("2023-12-31"))
  • calculates shooting incident rates for both Neighborhood Tabulation Areas (NTAs) and boroughs by year.
    detailed code
    #add a column that shows the shooting incident rate of NTAs in that year
    df_education <- df_education %>%
      group_by(NTA, Year) %>%
      mutate(incident_rate_by_year_nta = (n() / Total_population)*100) %>%
      ungroup()
    
    #add a column that shows the shooting incident rate of boroughs in that year
    boro_population <- df_education %>%
      group_by(BORO, Year) %>%
      summarise(total_population_boro = sum(Total_population, na.rm = TRUE)) %>%
      ungroup()
    
    df_education <- df_education %>%
      left_join(boro_population, by = c("BORO", "Year")) %>%
      group_by(BORO, Year) %>%
      mutate(incident_rate_by_year_boro = (n() / total_population_boro)*100) %>%
      ungroup()
    
    #Rename a column
    df_education <- df_education %>%
      rename(Total_population_nta = Total_population)
    
    write.csv(df_education, "data_final.csv", row.names = FALSE)

The detailed code is here

Exploratory data analysis

library(ggplot2)
library(readxl)
library(readr)
library(tidyverse)
library(dplyr)
library(sf)
library(plotly)
library(geojsonio)
library(knitr)
library(tidyr)
# Load the dataset
df_descriptive=read_csv("data_final.csv")
data_final <- read_csv("data_final.csv")
# Grouping the data by year to count the number of incidents
df_descriptive$Year <- as.factor(df_descriptive$Year)
year_counts <- as.data.frame(table(df_descriptive$Year))
colnames(year_counts) <- c('Year', 'Count')


# Plotting the line chart for shooting incidents by year with data points labeled
ggplot(year_counts, aes(x = Year, y = Count, group = 1)) +
  geom_line(color = 'blue') +
  geom_point(size = 3) +
  geom_text(aes(label = Count), vjust = -0.5, size = 3) +
  labs(title = 'Shooting Incidents by Year in NYC',
       x = 'Year',
       y = 'Number of Shooting Incidents') +
  theme_minimal()

The chart shows a stable trend in shooting incidents from 2017 to 2019, followed by a sharp increase in 2020, likely linked to socio-economic factors like the COVID-19 pandemic. Incidents peaked in 2021 and then declined through 2023.

sky_dark_counts <- df_descriptive %>%
  group_by(Sky_Is_Dark) %>%
  summarise(total_incidents = n())

ggplot(sky_dark_counts, aes(x = Sky_Is_Dark, y = total_incidents, fill = Sky_Is_Dark)) +
  geom_bar(stat = 'identity') +
  scale_x_discrete(labels = c("FALSE" = "Daytime", "TRUE" = "Nighttime")) + # Update x-axis labels
  scale_fill_manual(values = c("FALSE" = "skyblue", "TRUE" = "darkblue"),
                    labels = c("FALSE" = "Daytime", "TRUE" = "Nighttime")) +
  labs(title = 'Number of Shooting Incidents by time (Daytime vs. Nigttime)',
       x = 'Time of a day(Daytime vs. Nigttime)',
       y = 'Number of Shooting Incidents') +
  theme_minimal()

The number of shooting incidents when the sky was dark is significantly higher than when it was bright. This might indicate that shootings are more likely to occur during nighttime or low visibility conditions. The higher number of incidents during dark conditions could be due to factors such as reduced visibility, higher activity at night, or fewer people around, making it easier for incidents to occur undetected

# Table showing Top 10 total incident NTA in Each Borough
incident_by_nta_borough <- df_descriptive %>%
  drop_na() %>%
  group_by(BORO, NTA) %>%
  summarise(total_incidents = n()) %>%
  arrange(BORO, desc(total_incidents)) %>%
  group_by(BORO) %>%
  slice_max(n = 10, order_by = total_incidents) %>%
  select(-total_incidents) %>%
  group_by(BORO) %>%
  mutate(row_num = row_number()) %>%
  pivot_wider(names_from = BORO, values_from = NTA) %>%
  unnest(cols = c(BRONX, BROOKLYN, MANHATTAN, QUEENS, `STATEN ISLAND`)) %>%
  select(row_num, everything())%>%
  slice(1:10)

# Display the table in the desired format
kable(incident_by_nta_borough, caption = "Top 10 NTAs with Total Shooting Incidents in Each Borough (2017-2023)")
Top 10 NTAs with Total Shooting Incidents in Each Borough (2017-2023)
row_num BRONX BROOKLYN MANHATTAN QUEENS STATEN ISLAND
1 Mott Haven-Port Morris Brownsville East Harlem (North) Far Rockaway-Bayswater St. George-New Brighton
2 Mount Hope Bedford-Stuyvesant (East) Harlem (North) Baisley Park Tompkinsville-Stapleton-Clifton-Fox Hills
3 Tremont Crown Heights (North) East Harlem (South) Richmond Hill West New Brighton-Silver Lake-Grymes Hill
4 Mount Eden-Claremont (West) East New York-New Lots Harlem (South) Jamaica Mariner’s Harbor-Arlington-Graniteville
5 Concourse-Concourse Village East Flatbush-Remsen Village Washington Heights (South) Rockaway Beach-Arverne-Edgemere Annadale-Huguenot-Prince’s Bay-Woodrow
6 Williamsbridge-Olinville East New York (North) Chelsea-Hudson Yards St. Albans Port Richmond
7 Melrose Bedford-Stuyvesant (West) Hamilton Heights-Sugar Hill South Ozone Park Rosebank-Shore Acres-Park Hill
8 Belmont Canarsie Inwood Elmhurst Grasmere-Arrochar-South Beach-Dongan Hills
9 Longwood Coney Island-Sea Gate Chinatown-Two Bridges South Jamaica Great Kills-Eltingville
10 Fordham Heights East New York-City Line Washington Heights (North) Astoria (East)-Woodside (North) New Dorp-Midland Beach

This table allows for comparison of which neighborhoods within each borough have experienced the highest shooting incidents, and helps identify the differences in levels of gun violence among the boroughs of New York City.

# Plot: Description of Victims
victim_summary <- df_descriptive %>%
  filter(VIC_SEX %in% c("M", "F")) %>%  # Filter out unknown
  group_by(VIC_SEX, VIC_AGE_GROUP, VIC_RACE) %>%
  summarise(total_victims = n()) %>%
  drop_na()


# Plotting the description of male victims
ggplot(victim_summary %>% filter(VIC_SEX == "M"), aes(x = VIC_AGE_GROUP, y = total_victims, fill = VIC_RACE)) +
  geom_bar(stat = 'identity', position = 'dodge') +
  labs(title = "Description of Male Shooting Victims by Age Group and Race",
       x = "Victim Age Group",
       y = "Number of Victims",
       fill = "Victim Race") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Plotting the description of female victims
ggplot(victim_summary %>% filter(VIC_SEX == "F"), aes(x = VIC_AGE_GROUP, y = total_victims, fill = VIC_RACE)) +
  geom_bar(stat = 'identity', position = 'dodge') +
  labs(title = "Description of Female Shooting Victims by Age Group and Race",
       x = "Victim Age Group",
       y = "Number of Victims",
       fill = "Victim Race") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The bar charts show the distribution of male and female shooting victims by age group and race in NYC. In both charts, Black victims are the most affected across all age groups, with the highest number of victims in the 18-24 and 25-44 age groups. For males, the 25-44 age group shows a notable peak, while for females, the same age group also has the highest numbers. White Hispanic victims also show considerable numbers, particularly in the 25-44 age group.

# Plot: Bar chart showing victim's age
ggplot(df_descriptive %>% filter(!is.na(VIC_AGE_GROUP)& VIC_AGE_GROUP != 1022), aes(x = VIC_AGE_GROUP)) +
  geom_bar(fill = "steelblue") +
  labs(title = "Bar Chart of Victim's Age",
       x = "Victim Age Group",
       y = "Count of Victims") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The 25-44 age group has the highest number of victims, significantly more than any other group. The 18-24 age group also shows a considerable number of victims, while the <18 and 45-64 age groups have notably fewer victims. The 65+ age group has the least number of victims.

Maps

Plot of Total Number of Incidents Across NYC BOROs (2017-2023)

# Count the boro incident
boro_incident_counts <- data_final %>%
  group_by(BORO) %>%
  summarise(Number_of_Incidents = n(), .groups = "drop")  %>%
  mutate(BORO = tolower(BORO) )

# Lowercase the boro in boro_shape
boro_shape = boro_shape %>%
  mutate(boro_name = tolower(boro_name))
 

# Merge spatial data with incident counts
boro_map_data <- boro_shape %>%
  left_join(boro_incident_counts, by = c("boro_name" = "BORO"))

data_final <- read_csv("data_final.csv")
boro_map_data <- boro_map_data %>%
  mutate(
    hover_text = paste("Borough:", boro_name, "<br>Total Incidents:", Number_of_Incidents)
  )

# Create the interactive plot with click functionality
plot <- plot_ly(
  data = boro_map_data,
  type = "scattermapbox",
  split = ~boro_name,  # Separate polygons by boroughs
  color = ~Number_of_Incidents,  # Color based on the number of incidents
  colors = "viridis",  # Use a color scale
  text = ~hover_text,  # Display hover text
  hoverinfo = "text",
  marker = list(size = 8, opacity = 0.7)
) %>%
  layout(
    title = "Total Number of Incidents Across NYC BOROs (2017-2023)",
    mapbox = list(
      style = "carto-positron",  # Base map style
      center = list(lon = -74.0060, lat = 40.7128),  # Center map on NYC
      zoom = 10
    )
  )

# Add click functionality to display the borough name and number of incidents
plot <- plot %>%
  event_register("plotly_click") %>%
  htmlwidgets::onRender("
    function(el, x) {
      el.on('plotly_click', function(d) {
        var point = d.points[0];
        var text = point.text;
        alert('You clicked on: ' + text);
      });
    }
  ")

# Display the interactive plot
plot

The map shows the total number of incidents across NYC boroughs (2017–2023), with each borough represented by a distinct color. A gradient is used to indicate the magnitude of incidents, with darker shades corresponding to higher counts, ranging from 0 to over 3,000 incidents. Each borough is outlined and filled with its respective color, making it easy to distinguish. The legend on the right identifies the boroughs (Bronx, Brooklyn, Manhattan, Queens, Staten Island) and aligns with the color gradient to show the number of incidents. This visualization highlights geographical disparities in incident frequency across the boroughs, aiding in understanding spatial distribution patterns.

Plot of Total Number of Incidents Across NYC CDTAs from 2017 to 2023

data_final$CDTA <- gsub(" ", "", data_final$CDTA)

cdta_incident_counts <- data_final %>%
  group_by(CDTA) %>%
  summarise(Number_of_Incidents = n(), .groups = "drop")

# Remove any trailing spaces or mismatches in CDTA identifiers:
cdta_shape$CDTA2020 <- gsub(" ", "", cdta_shape$CDTA2020)
data_final$CDTA <- gsub(" ", "", data_final$CDTA)

# Identify Missing Matches
unmatched_cdta <- setdiff(cdta_shape$CDTA2020, data_final$CDTA)

#Re-Merge the Data
cdta_map_data <- cdta_shape %>%
  left_join(cdta_incident_counts, by = c("CDTA2020" = "CDTA"))

# Update NA Handling
cdta_map_data <- cdta_map_data %>%
  mutate(
    Number_of_Incidents = ifelse(is.na(Number_of_Incidents), 0, Number_of_Incidents),
    Incident_Range = cut(
      Number_of_Incidents,
      breaks = seq(0, 600, by = 120), 
      labels = c("0-120", "121-240", "241-360", "361-480", "481-600"),
      include.lowest = TRUE
    )
  )

ggplot(data = cdta_map_data) +
  geom_sf(aes(fill = Incident_Range), color = "white", size = 0.2) +
 geom_sf_text(aes(label = Number_of_Incidents), size = 3, color = "black") +  # Add labels+
  scale_fill_manual(
    values = c(
      "0-120" = "#b2e2e2",
      "121-240" = "skyblue",
      "241-360" = "#66c2a4",
      "361-480" = "#2ca25f",
      "481-600" = "#006d2c"
    ),
    name = "Number of Incidents"
  ) +
  labs(
    title = "Total Number of Incidents Across NYC CDTAs from 2017 to 2023",
    subtitle = "Incidents grouped by range (0-600, 120 breaks)"
  ) +
  theme_minimal() +
  theme(
    axis.text = element_blank(),
    axis.ticks = element_blank(),
    panel.grid = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank()
  )

The map shows NYC CDTA incidents (2017–2023) using a gradient from light blue (fewer incidents) to dark green (more incidents), with counts labeled on each district. Dark green areas highlight hotspots, likely in densely populated regions, while lighter blue areas, like Staten Island, show fewer incidents. This visualization helps identify trends and prioritize safety efforts.

Number of Incidents counted by CDTAs in each Boroughs from 2017 to 2023

boroughs <- unique(cdta_map_data$BoroName)
for (b in boroughs) {
    borough_data <- cdta_map_data %>%
        filter(BoroName == b)
    plot <- ggplot(data = borough_data) +
        geom_sf(aes(fill = Number_of_Incidents), color = "black") +
      geom_sf_text(aes(label = Number_of_Incidents), size = 3, color = "black") +  # Add labels
        scale_fill_gradientn(
      colors = c( "skyblue", "green", "yellow", "red"), # Custom color scale
      name = "Number of Incidents"
    ) +
        labs(
            title = paste("CDTA Incidents in", b),
            subtitle = "2017 to 2023",
            x = "Longitude",
            y = "Latitude"
        ) +
        theme_minimal()
    print(plot) 
}

The Bronx demonstrates significant variability in the number of incidents across its CDTAs, with central and southern regions experiencing the highest numbers of incidents, some exceeding 300. This highlights the borough as a hotspot for crime relative to others. Socioeconomic challenges, high population density, or structural inequities could contribute to these trends. Interventions should prioritize these high-incident areas by allocating additional law enforcement resources and community support programs to mitigate the causes of elevated crime levels.

Manhattan shows a relatively low overall number of incidents, except for a few northern neighborhoods, where one CDTA is a clear outlier with incidents surpassing 1200. This disparity suggests a concentration of crime in specific areas rather than borough-wide. The localized nature of the issue could be due to unique socioeconomic pressures or demographic factors in those neighborhoods. Targeted initiatives such as community policing or economic investment in northern Manhattan could help address these localized hotspots.

Queens generally exhibits a safer profile with most CDTAs reporting fewer than 100 incidents. However, one central area stands out with more than 400 incidents, marking it as a concern. The low number of incidents elsewhere reflects the suburban and less dense nature of the borough. Efforts in Queens should focus on analyzing the central hotspot to determine the underlying causes, such as potential economic or social stressors, and implementing programs to improve community safety.

Brooklyn presents a widespread distribution of incidents, with several CDTAs exceeding 500 incidents. The borough shows higher overall numbers compared to Queens and Manhattan, which may be linked to its high population density and mixed socioeconomic profile. The more even distribution of incidents suggests a need for borough-wide interventions rather than focusing on isolated areas. A combination of public safety initiatives, housing improvements, and community programs would be essential for addressing crime across the borough.

Staten Island stands out with the lowest number of incidents among the five boroughs. Most CDTAs report fewer than 50 incidents, with one area as an exception, exceeding 200 incidents. This aligns with Staten Island’s suburban and less densely populated character, which likely contributes to its lower crime rates. Maintaining this safety level requires continued focus on community engagement and ensuring sufficient resources are available to address any emerging hotspots.

Statistical analysis

Method for Calculating Correlation Coefficients

To assess potential linear relationships between socioeconomic variables and shooting rates, correlation coefficients were calculated. Specifically, the relationship between poverty rates and shooting rates, as well as between the percentage of high school graduates and shooting rates, was explored across different neighborhoods in NYC.

Poverty

Calculate the correlation between the poverty percentage and the incident rate.

correlation <- cor(data_clean$incident_rate_by_year_nta, data_clean$Percent_poverty, use = "complete.obs")
print(paste("Correlation coefficient: ", correlation))
## [1] "Correlation coefficient:  0.508408410575774"
data_clean %>%
  plot_ly(x = ~Percent_poverty, y = ~incident_rate_by_year_nta, 
          color = ~NTA, colors = "viridis", 
          type = "scatter", mode = "markers",
          text = ~paste("Neighborhood: ", NTA, "<br>Borough: ", BORO, 
                        "<br>% Below Poverty Line: ", Percent_poverty, 
                        "<br>Incident Rate: ", incident_rate_by_year_nta)) %>%
  layout(title = "Percent Below the Poverty Line and Incident Rate in NYC",
         xaxis = list(title = 'Percentage of People Below the Poverty Line'),
         yaxis = list(title = 'Incident Rate'),
         legend = list(title = list(text = 'Neighborhood')))
# Scatter plot for Brooklyn
data_clean |> 
  filter(neighbourhood_group == "Brooklyn") |> 
  plot_ly(data = _, x = ~Percent_poverty, y = ~incident_rate_by_year_nta, 
          color = ~NTA,
          colors = "plasma", 
          type = "scatter",
          mode = "markers",
          text = ~paste("Neighborhood: ", NTA, "<br>Borough: ", neighbourhood_group, 
                        "<br>% Below Poverty Line: ", Percent_poverty, 
                        "<br>Incident Rate: ", incident_rate_by_year_nta)) |> 
    layout(title = "Percent Below the Poverty Line and Incident Rate in Brooklyn",
           xaxis = list(title = 'Percentage of People Below the Poverty Line'),
           yaxis = list(title = 'Incident Rate'),
           legend = list(title = list(text = 'Neighborhood')))
# Scatter plot for Staten Island
data_clean |> 
  filter(neighbourhood_group == "Staten Island") |> 
  plot_ly(data = _, x = ~Percent_poverty, y = ~incident_rate_by_year_nta, 
          color = ~NTA,
          colors = "inferno", 
          type = "scatter",
          mode = "markers",
          text = ~paste("Neighborhood: ", NTA, "<br>Borough: ", neighbourhood_group, 
                        "<br>% Below Poverty Line: ", Percent_poverty, 
                        "<br>Incident Rate: ", incident_rate_by_year_nta)) |> 
    layout(title = "Percent Below the Poverty Line and Incident Rate in Staten Island",
           xaxis = list(title = 'Percentage of People Below the Poverty Line'),
           yaxis = list(title = 'Incident Rate'),
           legend = list(title = list(text = 'Neighborhood')))
  • Across all neighborhoods in NYC, there is a moderate positive linear relationship between the percentage of people below the poverty line (Percent_poverty) and the incident rate by neighborhood (incident_rate_by_year_nta).(r = 0.508).
  • Associations differ by borough.
  • Notably, Brooklyn (0.5686) and Staten Island (0.5576) show the highest correlations, suggesting that the relationship between poverty and incident rate is stronger in these boroughs.

Education

Calculate the correlation between the graduated in highschool percentage and the incident rate

# Calculate the correlation between the graduated in highschool percentage and the incident rate
correlation <- cor(data_clean$incident_rate_by_year_nta, data_clean$Percent_education, use = "complete.obs")
print(paste("Correlation coefficient: ", correlation))
## [1] "Correlation coefficient:  -0.274782643621427"
# Create a scatter plot to visualize the relationship
data_clean %>%
  plot_ly(x = ~Percent_education, y = ~incident_rate_by_year_nta, 
          color = ~NTA, colors = "viridis", 
          type = "scatter", mode = "markers",
          text = ~paste("Neighborhood: ", NTA, "<br>Borough: ", BORO, 
                        "<br>% graduated HS: ", Percent_education, 
                        "<br>Incident Rate: ", incident_rate_by_year_nta)) %>%
  layout(title = "Percent graduated high school and Incident Rate in NYC",
         xaxis = list(title = 'Percentage of People graduated in high school'),
         yaxis = list(title = 'Incident Rate'),
         legend = list(title = list(text = 'Neighborhood')))
# Scatter plot for The Bronx
data_clean |> 
  filter(neighbourhood_group == "Bronx") |>
  plot_ly(data = _, x = ~Percent_education, y = ~incident_rate_by_year_nta, 
          color = ~NTA,
          colors = "magma", 
          type = "scatter",
          mode = "markers",
          text = ~paste("Neighborhood: ", NTA, "<br>Borough: ", neighbourhood_group, 
                        "<br>% graduated HS: ", Percent_education, 
                        "<br>Incident Rate: ", incident_rate_by_year_nta)) |> 
   layout(title = "Percent graduated high school and Incident Rate in The Bronx",
           xaxis = list(title = 'Percentage of People graduated in high school'),
           yaxis = list(title = 'Incident Rate'),
           legend = list(title = list(text = 'Neighborhood')))
# Scatter plot for Staten Island
data_clean |> 
  filter(neighbourhood_group == "Staten Island") |>
  plot_ly(data = _, x = ~Percent_education, y = ~incident_rate_by_year_nta, 
          color = ~NTA,
          colors = "inferno", 
          type = "scatter",
          mode = "markers",
          text = ~paste("Neighborhood: ", NTA, "<br>Borough: ", neighbourhood_group, 
                        "<br>% graduated HS: ", Percent_education, 
                        "<br>Incident Rate: ", incident_rate_by_year_nta)) |> 
   layout(title = "Percent graduated high school and Incident Rate in Staten Island",
           xaxis = list(title = 'Percentage of People graduated in Staten Island'),
           yaxis = list(title = 'Incident Rate'),
           legend = list(title = list(text = 'Neighborhood')))
  • Across all neighborhoods in NYC,there a negative linear relationship between high school graduation rates and incident rates.(r = -0.2748)
  • Associations differ by borough.
  • Notably, The Bronx (-0.3996) and Staten Island (-0.6452) show the highest correlations, suggesting that the relationship between percentage of people who graduated from high school and incident rate is stronger in these boroughs.

Discussion

In conclusion,our exploratory analyses reveals key insights into the distribution of shooting incidents in New York City, their incident rates, and their relationship with sociodemographic factors, particularly poverty and education levels. This research will aid in informing targeted public health and safety interventions by identifying the communities most impacted by shootings. Further discussion of our results for specific analyses are below.

Shooting Incident Distribution

The spatial distribution of shooting incidents in NYC shows notable concentration in certain Neighborhood Tabulation Areas (NTAs), particularly in the Bronx, Brooklyn, and Manhattan. Specifically, neighborhoods like Mott Haven-Port Morris in the Bronx and Brownsville in Brooklyn show consistently high rates of shooting incidents. Staten Island, by comparison, shows lower rates of shooting incidents, although neighborhoods like St. George-New Brighton are more heavily impacted within the borough.

The temporal analysis also highlights a spike in incidents during 2020, likely linked to socio-economic disruptions caused by the COVID-19 pandemic. This peak was followed by a decline through 2023, suggesting that some return to normality or enhanced public safety measures may have played a role in the reduction of shootings. Furthermore, incidents are more prevalent during periods of darkness, which might reflect reduced visibility, fewer people around, and greater opportunities for undetected actions.

Incident Rates and Sociodemographic Correlation

The analysis of incident rates in different NTAs and their correlation with sociodemographic factors reveals several important trends:

Poverty and Shooting Rates:

The correlation between poverty levels and shooting rates is moderately positive (r = 0.508), indicating that neighborhoods with higher poverty levels tend to have more shooting incidents. This pattern was particularly strong in Brooklyn (0.5686) and Staten Island (0.5576), suggesting that economic hardships might exacerbate vulnerability to gun violence in these areas.

Education and Shooting Rates:

The analysis showed a negative correlation (-0.2748) between high school graduation rates and shooting incidents, implying that neighborhoods with higher educational attainment tend to experience fewer shootings. Notably, the relationship was strongest in The Bronx (0.3996) and Staten Island (-0.6452), where increased high school graduation rates are associated with a marked decrease in shooting incidents. However, this trend was less evident in Manhattan, suggesting that factors beyond education may be more influential in certain areas.

Sociodemographic Impact

Our findings suggest that socio-economic conditions play a substantial role in the spatial and temporal distribution of shootings in NYC. Neighborhoods with lower educational attainment and higher poverty rates are disproportionately impacted by gun violence. This highlights the importance of targeted interventions focused on improving socio-economic conditions, particularly education and employment opportunities, as means of mitigating gun violence.

The disparity among boroughs also suggests that local contexts and community-specific dynamics are key in understanding gun violence. For example, while poverty was a significant factor across the city, its impact varied, with Staten Island and Brooklyn showing a stronger relationship with shooting rates. Similarly, education had a stronger protective effect in Staten Island and the Bronx.

Limitations

While this study provides an initial exploration of shooting patterns and their relationship to sociodemographic factors, there are several limitations. First, the use of correlational analysis means that we cannot establish causation. Furthermore, we may have missed important confounders, such as policing practices or the presence of gang activity, which could influence shooting incidents. Future studies should incorporate these factors and explore causal relationships, possibly using more advanced statistical methods, such as multilevel models or causal inference techniques.